Goal of the notebook: I will be using the tool Bokeh to visualize data I have collected. This data is a csv file created from the generated questions and used on different questions answering models based on a Lunch form. I will be trying out different plots to see which ones are better suited based on the audience I am showing the data/results to.
Side Note: Later in the notebook I change from using the Bokeh tool to using the Plotly visualization tool. The reason for switching over to plotly is for easy to use visualization tool which is also new for me.
import pandas as pd
import os
import numpy as np
# Circle
from math import pi
from bokeh.palettes import Category20c
from bokeh.transform import cumsum
from bokeh.plotting import figure, output_notebook, show
from squarify import normalize_sizes, squarify
from bokeh.sampledata.sample_superstore import data
from bokeh.transform import factor_cmap
import plotly.express as px
import plotly
The data that will be used are csv files created from the generated questions and used on different questions answering models. The data is a combination of answers different models predicted for each label of a form.
df = pd.read_csv(r'C:\Users\victo\source\repos\Semester 7\JupyterLab\Group\Question Generator\csv_ouput\df_merged.csv', index_col=[0])
# delete one by one like column is 'Unnamed: 0' so use it's name
# df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()
| label | questions | answer | score | model | percentage | actual_answer | correctly_predicted | occurence | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Number of Attendees | what Number of Attendees? | 15 | 62.77 | model 1 | 1.539416 | 15 | True | 1 |
| 1 | Number of Attendees | who Number of Attendees? | 15 | 67.06 | model 1 | 1.644627 | 15 | True | 1 |
| 2 | Number of Attendees | where Number of Attendees? | 15 | 46.99 | model 1 | 1.152416 | 15 | True | 1 |
| 3 | Number of Attendees | when Number of Attendees? | 15 | 55.51 | model 1 | 1.361367 | 15 | True | 1 |
| 4 | Number of Attendees | why Number of Attendees? | 15 | 55.14 | model 1 | 1.352293 | 15 | True | 1 |
In this chapter I will be exploring the data using Bokeh.
I will be visually representing the different labels that are used in the dataset
x = df.label.value_counts()
data = pd.Series(x).reset_index(name='value').rename(columns={'index': 'country'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Category20c[len(x)]
data
| country | value | angle | color | |
|---|---|---|---|---|
| 0 | Number of Attendees | 59 | 1.256637 | #3182bd |
| 1 | Date | 55 | 1.171441 | #6baed6 |
| 2 | End Time | 43 | 0.915854 | #9ecae1 |
| 3 | Start Time | 41 | 0.873256 | #c6dbef |
| 4 | Budget | 27 | 0.575071 | #e6550d |
| 5 | Contact Details | 24 | 0.511174 | #fd8d3c |
| 6 | Organizer | 20 | 0.425979 | #fdae6b |
| 7 | Location | 10 | 0.212989 | #fdd0a2 |
| 8 | Food Allergies | 9 | 0.191690 | #31a354 |
| 9 | Food Diets | 7 | 0.149093 | #74c476 |
p = figure(height=350, title="Pie Chart", toolbar_location=None,
tools="hover", tooltips="@country: @value", x_range=(-0.5, 1.0))
p.wedge(x=0, y=1, radius=0.4,
start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
line_color="white", fill_color='color', legend_field='country', source=data)
p.axis.axis_label = None
p.axis.visible = False
p.grid.grid_line_color = None
output_notebook()
show(p)
So a little background of the data. The data is based questions created for a lunch form. The lunch form has different kinds of labels which we will touch down in a bit but as we can see from the above piechart, we see the different form labels.
Creating a tree map using Bokeh's example and applying to the data being used in this notebook.
def treemap(df, col, x, y, dx, dy, *, N=100):
sub_df = df.nlargest(N, col)
normed = normalize_sizes(sub_df[col], dx, dy)
blocks = squarify(normed, x, y, dx, dy)
blocks_df = pd.DataFrame.from_dict(blocks).set_index(sub_df.index)
return sub_df.join(blocks_df, how='left').reset_index()
df_correct_prediction = df[df.correctly_predicted != False]
df_correct_prediction.head()
| label | questions | answer | score | model | percentage | actual_answer | correctly_predicted | occurence | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Number of Attendees | what Number of Attendees? | 15 | 62.77 | model 1 | 1.539416 | 15 | True | 1 |
| 1 | Number of Attendees | who Number of Attendees? | 15 | 67.06 | model 1 | 1.644627 | 15 | True | 1 |
| 2 | Number of Attendees | where Number of Attendees? | 15 | 46.99 | model 1 | 1.152416 | 15 | True | 1 |
| 3 | Number of Attendees | when Number of Attendees? | 15 | 55.51 | model 1 | 1.361367 | 15 | True | 1 |
| 4 | Number of Attendees | why Number of Attendees? | 15 | 55.14 | model 1 | 1.352293 | 15 | True | 1 |
df_correct_prediction.shape
(295, 9)
a = df['model'].unique()
models = sorted(a)
print(sorted(models))
['model 1', 'model 2', 'model 3', 'model 4', 'model 5', 'model 6']
score_by_label = df_correct_prediction.groupby(["model", "label"]).sum("correctly_predicted")
score_by_label = score_by_label.sort_values(by="correctly_predicted").reset_index()
score_by_model = score_by_label.groupby("model").sum("correctly_predicted").sort_values(by="correctly_predicted")
score_by_model
| score | percentage | correctly_predicted | occurence | |
|---|---|---|---|---|
| model | ||||
| model 3 | 2557.03 | 146.987787 | 39 | 39 |
| model 2 | 2195.60 | 90.754293 | 41 | 41 |
| model 6 | 1213.33 | 43.869068 | 47 | 47 |
| model 1 | 1035.87 | 40.455861 | 49 | 49 |
| model 5 | 1942.50 | 105.797038 | 57 | 57 |
| model 4 | 2117.84 | 102.528483 | 62 | 62 |
x, y, w, h = 0, 0, 800, 450
blocks_by_model = treemap(score_by_model, "correctly_predicted", x, y, w, h)
dfs = []
for index, (model, score, percentage,correctly_predicted,occurence,x, y, dx, dy) in blocks_by_model.iterrows():
df_score = score_by_label[score_by_label.model==model]
# print(df_score)
dfs.append(treemap(df_score, "correctly_predicted", x, y, dx, dy, N=10))
blocks = pd.concat(dfs)
p = figure(width=w, height=h, tooltips="@label", toolbar_location=None,
x_axis_location=None, y_axis_location=None)
p.x_range.range_padding = p.y_range.range_padding = 0
p.grid.grid_line_color = None
p.block('x', 'y', 'dx', 'dy', source=blocks, line_width=1, line_color="white",
fill_alpha=0.8, fill_color=factor_cmap("model", "MediumContrast4", models))
p.text('x', 'y', x_offset=2, text="model", source=blocks_by_model,
text_font_size="18pt", text_color="white")
blocks["ytop"] = blocks.y + blocks.dy
p.text('x', 'ytop', x_offset=2, y_offset=2, text="label", source=blocks,
text_font_size="6pt", text_baseline="top",
text_color=factor_cmap("model", ("black", "white", "black", "white","black", "white"), models))
show(p)
After seeing how hard it was to set up a treemap in bokeh compared to plotly i chose to do my visualization in plotly for easy to use.
I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal.
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'label', 'model', 'actual_answer'], values='occurence', title="Predicting correct answer occurence per label based on each model")
fig.update_traces(root_color="lightgrey", marker=dict(cornerradius=5))
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'model', 'label'], values='occurence', title="Predicting correct answer occurence per model based on each form label")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal.
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'label', 'model', 'actual_answer'], values='score', title="Prediction confidence score per label for each model")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
import plotly.express as px
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'model', 'label'], values='score', title="Prediction confidence score per model based for each form label")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
fig = px.pie(df_correct_prediction, values='occurence', names='label',width=1000, height=500, color_discrete_sequence=px.colors.sequential.RdBu, title='Occurence of Predicting Correct Answer')
fig.show()
Dot plots (also known as Cleveland dot plots) are scatter plots with one categorical axis and one continuous axis. They can be used to show changes between two (or more) points in time or between two (or more) conditions. Compared to a bar chart, dot plots can be less cluttered and allow for an easier comparison between conditions.
fig = px.scatter(score_by_label.sort_values('model'), y="label", x="correctly_predicted", color="model", symbol="model",
title='Number of Predicted Correct Answer per Label for each model')
fig.update_traces(marker_size=10)
fig.show()
I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal with using the horizontal bar chart.
fig = px.bar(score_by_label.sort_values('model'), x="correctly_predicted", y="label", color='model', orientation='h',
hover_data=["correctly_predicted", "score"],
height=400,
title='Number of Predicted Correct Answer per Label for each model')
fig.show()
I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal with using the horizontal bar chart.
fig = px.bar(score_by_label.sort_values('model'), x="score", y="label", color='model', orientation='h',
hover_data=["correctly_predicted", "score"],
height=400,
title='Sum of prediction confidence score of each labels based on types of models')
fig.show()
Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves. Similar to Icicle charts and Treemaps, the hierarchy is defined by labels (names for px.icicle) and parents attributes. The root starts from the center and children are added to the outer rings.
I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal.
fig = px.sunburst(score_by_label, path=['label', 'model'],width=1000, height=500, values='correctly_predicted',
title='Number of Predicted Correct Answer per Label for each model')
fig.show()
I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal.
fig = px.sunburst(score_by_label, path=['label', 'model'], values='score',width=1000, height=500, title = 'Prediction confidence score of each labels based on types of models')
fig.show()
Icicle charts visualize hierarchical data using rectangular sectors that cascade from root to leaves in one of four directions: up, down, left, or right. Similar to Sunburst charts and Treemaps charts, the hierarchy is defined by labels (names for px.icicle) and parents attributes. Click on one sector to zoom in/out, which also displays a pathbar on the top of your icicle. To zoom out, you can click the parent sector or click the pathbar as well.
fig = px.icicle(score_by_label, path=[px.Constant("Lunch Form"), 'label', 'model'], values='correctly_predicted',
title='Number of Predicted Correct Answer per Label for each model')
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
fig = px.area(score_by_label, x="label", y="correctly_predicted", color="model", pattern_shape="model",
title='Number of Predicted Correct Answer per Label for each model')
fig.show()
fig.write_html(r"C:\Users\victo\source\repos\Semester 7\JupyterLab\Data Visualization\file.html")
To conclude, this notebook shows different visualization plots that express the data collected with both using the Bokeh and Plotly tools. During this exercise I not only learned how to use these two tools but also different ways to visualize my data in a more user friendly way of interacting with the data which I personally like.
plotly.offline.init_notebook_mode()